Homework 3: Wikipedia Clustering

نویسندگان

Cliff Engle

Antonio Lupher

چکیده

Clustering is an important machine learning task that tackles the problem of classifying data into distinct groups based on their features. An ideal clustering algorithm maximizes feature similarities within a cluster while minimizing the feature similarities across clusters. Some of the most common clustering algorithms include spectral clustering and k-means clustering. This project essentially consists of two main parts: extracting features into a bag-of-words representation and then performing clustering using these features.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

CS 294 - 1 Homework 3 Timothy Hunter and Andre

In this assignment, the goal was to parse a large set of Wikipedia articles, to extract features into a sparse feature matrix and to cluster them with a clustering algorithm of our choice. One motivation to perform automated clustering on unstructured, unlabeled data is to detect correlations between data points; for instance, in the case of Wikipedia, one might be able to automatically group a...

متن کامل

Categorization of Wikipedia Articles with Spectral Clustering

The article reports application of clustering algorithms for creating hierarchical groups within Wikipedia articles. We evaluate three spectral clustering algorithms based on datasets constructed with usage of Wikipedia categories. Selected algorithm has been implemented in the system that categorize Wikipedia search results in the fly.

متن کامل

CSCE 313-200: Computer Systems

The goal of this project is to search for a given set of substrings in English Wikipedia, which exists on beefybox in four versions – tiny (50 MB), small (512 MB), medium (8 GB), and complete (28 GB). While Wikipedia does contain some UTF-8 characters, all target substrings in this homework are US ASCII (i.e., byte values below 128), which means that you will not have to perform any conversion ...

متن کامل

Multilingual Document Clustering Using Wikipedia as External Knowledge

This paper presents Multilingual Document Clustering (MDC) on comparable corpora. Wikipedia, a structured multilingual knowledge base, has been highly exploited in many monolingual clustering approaches and also in comparing multilingual corpora. But there is no prior work which studied the impact of Wikipedia on MDC. Here, we have made an in-depth study on availing Wikipedia in enhancing MDC p...

متن کامل

Evaluating the Performance of XML Document Clustering by Structure Only

This paper reports the results and experiments performed on the INEX 2006 Document Mining Challenge Corpus with the PCXSS clustering method. The PCXSS method is a progressive clustering method that computes the similarity between a new XML document and existing clusters by considering the structures within documents. We conducted the clustering task on the INEX and Wikipedia data sets.

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

Homework 3: Wikipedia Clustering

نویسندگان

چکیده

منابع مشابه

CS 294 - 1 Homework 3 Timothy Hunter and Andre

Categorization of Wikipedia Articles with Spectral Clustering

CSCE 313-200: Computer Systems

Multilingual Document Clustering Using Wikipedia as External Knowledge

Evaluating the Performance of XML Document Clustering by Structure Only

عنوان ژورنال:

اشتراک گذاری